The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.
# Library to suppress warnings or deprecation notes
# To help with reading and manipulating data
import pandas as pd
import numpy as np
# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# To be used for missing value imputation
from sklearn.impute import SimpleImputer
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
BaggingClassifier,
RandomForestClassifier,
)
from xgboost import XGBClassifier
from sklearn.dummy import DummyClassifier
# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
ConfusionMatrixDisplay,
make_scorer,
)
# To impute missing values and handle modeling
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score
from sklearn import preprocessing
from sklearn.preprocessing import PolynomialFeatures
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
# To build a logistic regression model
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import TomekLinks
from imblearn.combine import SMOTETomek
from imblearn.under_sampling import ClusterCentroids
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.svm import SVC
# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import TransformerMixin
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
!pip install lightgbm
import lightgbm as lgb
Requirement already satisfied: lightgbm in /usr/local/lib/python3.10/dist-packages (4.1.0) Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from lightgbm) (1.23.5) Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from lightgbm) (1.11.3)
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
card = pd.read_csv('/content/drive/My Drive/Project 3/BankChurners.csv')
data = card.copy()
data.head(10)
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
| 5 | 713061558 | Existing Customer | 44 | M | 2 | Graduate | Married | $40K - $60K | Blue | 36 | 3 | 1 | 2 | 4010.000 | 1247 | 2763.000 | 1.376 | 1088 | 24 | 0.846 | 0.311 |
| 6 | 810347208 | Existing Customer | 51 | M | 4 | NaN | Married | $120K + | Gold | 46 | 6 | 1 | 3 | 34516.000 | 2264 | 32252.000 | 1.975 | 1330 | 31 | 0.722 | 0.066 |
| 7 | 818906208 | Existing Customer | 32 | M | 0 | High School | NaN | $60K - $80K | Silver | 27 | 2 | 2 | 2 | 29081.000 | 1396 | 27685.000 | 2.204 | 1538 | 36 | 0.714 | 0.048 |
| 8 | 710930508 | Existing Customer | 37 | M | 3 | Uneducated | Single | $60K - $80K | Blue | 36 | 5 | 2 | 0 | 22352.000 | 2517 | 19835.000 | 3.355 | 1350 | 24 | 1.182 | 0.113 |
| 9 | 719661558 | Existing Customer | 48 | M | 2 | Graduate | Single | $80K - $120K | Blue | 36 | 6 | 3 | 3 | 11656.000 | 1677 | 9979.000 | 1.524 | 1441 | 32 | 0.882 | 0.144 |
data.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CLIENTNUM | 10127.000 | NaN | NaN | NaN | 739177606.334 | 36903783.450 | 708082083.000 | 713036770.500 | 717926358.000 | 773143533.000 | 828343083.000 |
| Attrition_Flag | 10127 | 2 | Existing Customer | 8500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Customer_Age | 10127.000 | NaN | NaN | NaN | 46.326 | 8.017 | 26.000 | 41.000 | 46.000 | 52.000 | 73.000 |
| Gender | 10127 | 2 | F | 5358 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Dependent_count | 10127.000 | NaN | NaN | NaN | 2.346 | 1.299 | 0.000 | 1.000 | 2.000 | 3.000 | 5.000 |
| Education_Level | 8608 | 6 | Graduate | 3128 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Marital_Status | 9378 | 3 | Married | 4687 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Income_Category | 10127 | 6 | Less than $40K | 3561 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Card_Category | 10127 | 4 | Blue | 9436 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Months_on_book | 10127.000 | NaN | NaN | NaN | 35.928 | 7.986 | 13.000 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.000 | NaN | NaN | NaN | 3.813 | 1.554 | 1.000 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.000 | NaN | NaN | NaN | 2.341 | 1.011 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.000 | NaN | NaN | NaN | 2.455 | 1.106 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.000 | NaN | NaN | NaN | 8631.954 | 9088.777 | 1438.300 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.000 | NaN | NaN | NaN | 1162.814 | 814.987 | 0.000 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.000 | NaN | NaN | NaN | 7469.140 | 9090.685 | 3.000 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.000 | NaN | NaN | NaN | 0.760 | 0.219 | 0.000 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.000 | NaN | NaN | NaN | 4404.086 | 3397.129 | 510.000 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.000 | NaN | NaN | NaN | 64.859 | 23.473 | 10.000 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.000 | NaN | NaN | NaN | 0.712 | 0.238 | 0.000 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.000 | NaN | NaN | NaN | 0.275 | 0.276 | 0.000 | 0.023 | 0.176 | 0.503 | 0.999 |
Observations:
Age has a large range of values i.e. 26 to 73.
Credit limit is between USD1438 to USD34516
More indept analysis is required.
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
Observation:
# Checking for the percentage of missing values in the dataset to corroborate the above statement
(data.isnull().sum() / data.isnull().count() * 100)
CLIENTNUM 0.000 Attrition_Flag 0.000 Customer_Age 0.000 Gender 0.000 Dependent_count 0.000 Education_Level 15.000 Marital_Status 7.396 Income_Category 0.000 Card_Category 0.000 Months_on_book 0.000 Total_Relationship_Count 0.000 Months_Inactive_12_mon 0.000 Contacts_Count_12_mon 0.000 Credit_Limit 0.000 Total_Revolving_Bal 0.000 Avg_Open_To_Buy 0.000 Total_Amt_Chng_Q4_Q1 0.000 Total_Trans_Amt 0.000 Total_Trans_Ct 0.000 Total_Ct_Chng_Q4_Q1 0.000 Avg_Utilization_Ratio 0.000 dtype: float64
Observation:
As it has been shown in the previous lectures, these values will be imputed after splitting the data into train,validation and test sets appropriately.
# let's check for duplicate values in the data
data.duplicated().sum()
0
Observation:
There are no duplicated values in the dataset
# Showing the unique values of each columns
data.nunique()
CLIENTNUM 10127 Attrition_Flag 2 Customer_Age 45 Gender 2 Dependent_count 6 Education_Level 6 Marital_Status 3 Income_Category 6 Card_Category 4 Months_on_book 44 Total_Relationship_Count 6 Months_Inactive_12_mon 7 Contacts_Count_12_mon 7 Credit_Limit 6205 Total_Revolving_Bal 1974 Avg_Open_To_Buy 6813 Total_Amt_Chng_Q4_Q1 1158 Total_Trans_Amt 5033 Total_Trans_Ct 126 Total_Ct_Chng_Q4_Q1 830 Avg_Utilization_Ratio 964 dtype: int64
Observation:
# for each of the categorical columns, lets see the unique values in them
for i in data.columns:
#print(i)
if data[i].dtype==object:
print(data[i].unique())
['Existing Customer' 'Attrited Customer'] ['M' 'F'] ['High School' 'Graduate' 'Uneducated' nan 'College' 'Post-Graduate' 'Doctorate'] ['Married' 'Single' nan 'Divorced'] ['$60K - $80K' 'Less than $40K' '$80K - $120K' '$40K - $60K' '$120K +' 'abc'] ['Blue' 'Gold' 'Silver' 'Platinum']
Observation:
This shows the unique values contained in the string values in the following columns
# statistical summary of the non-numerical columns in the dataset
data.describe(exclude=np.number).T
| count | unique | top | freq | |
|---|---|---|---|---|
| Attrition_Flag | 10127 | 2 | Existing Customer | 8500 |
| Gender | 10127 | 2 | F | 5358 |
| Education_Level | 8608 | 6 | Graduate | 3128 |
| Marital_Status | 9378 | 3 | Married | 4687 |
| Income_Category | 10127 | 6 | Less than $40K | 3561 |
| Card_Category | 10127 | 4 | Blue | 9436 |
Observation:
We can drop the column - CLIENTNUM because it's a unique number for each customer and will not add value to the model building.
# Dropping column - ID
data.drop(columns=["CLIENTNUM"], inplace=True)
Checking the count of each unique category in each of the categorical variables.
# Making a list of all catrgorical variables
cat_col = data.select_dtypes("object").columns.to_list()
# Printing number of count of each unique value in each column
for column in cat_col:
print(data[column].value_counts())
print("*" * 40)
Existing Customer 8500 Attrited Customer 1627 Name: Attrition_Flag, dtype: int64 **************************************** F 5358 M 4769 Name: Gender, dtype: int64 **************************************** Graduate 3128 High School 2013 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: Education_Level, dtype: int64 **************************************** Married 4687 Single 3943 Divorced 748 Name: Marital_Status, dtype: int64 **************************************** Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: Income_Category, dtype: int64 **************************************** Blue 9436 Silver 555 Gold 116 Platinum 20 Name: Card_Category, dtype: int64 ****************************************
# Get value counts
data["Attrition_Flag"].value_counts()
Existing Customer 8500 Attrited Customer 1627 Name: Attrition_Flag, dtype: int64
Observation:
We have more female customers as compared to male customers.
# viewing incorrect values in income column
data[data.Income_Category == "abc"]
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 19 | Existing Customer | 45 | F | 2 | Graduate | Married | abc | Blue | 37 | 6 | 1 | 2 | 14470.000 | 1157 | 13313.000 | 0.966 | 1207 | 21 | 0.909 | 0.080 |
| 28 | Existing Customer | 44 | F | 3 | Uneducated | Single | abc | Blue | 34 | 5 | 2 | 2 | 10100.000 | 0 | 10100.000 | 0.525 | 1052 | 18 | 1.571 | 0.000 |
| 39 | Attrited Customer | 66 | F | 0 | Doctorate | Married | abc | Blue | 56 | 5 | 4 | 3 | 7882.000 | 605 | 7277.000 | 1.052 | 704 | 16 | 0.143 | 0.077 |
| 44 | Existing Customer | 38 | F | 4 | Graduate | Single | abc | Blue | 28 | 2 | 3 | 3 | 9830.000 | 2055 | 7775.000 | 0.977 | 1042 | 23 | 0.917 | 0.209 |
| 58 | Existing Customer | 44 | F | 5 | Graduate | Married | abc | Blue | 35 | 4 | 1 | 2 | 6273.000 | 978 | 5295.000 | 2.275 | 1359 | 25 | 1.083 | 0.156 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10021 | Attrited Customer | 30 | F | 1 | Graduate | Married | abc | Blue | 18 | 4 | 1 | 4 | 4377.000 | 2517 | 1860.000 | 0.941 | 8759 | 74 | 0.609 | 0.575 |
| 10040 | Attrited Customer | 50 | F | 3 | Doctorate | Single | abc | Blue | 36 | 4 | 3 | 3 | 5173.000 | 0 | 5173.000 | 0.912 | 8757 | 68 | 0.789 | 0.000 |
| 10083 | Existing Customer | 42 | F | 4 | Uneducated | Married | abc | Blue | 23 | 4 | 1 | 2 | 8348.000 | 0 | 8348.000 | 0.695 | 15905 | 111 | 0.708 | 0.000 |
| 10092 | Attrited Customer | 40 | F | 3 | Graduate | Married | abc | Blue | 25 | 1 | 2 | 3 | 6888.000 | 1878 | 5010.000 | 1.059 | 9038 | 64 | 0.829 | 0.273 |
| 10119 | Attrited Customer | 55 | F | 3 | Uneducated | Single | abc | Blue | 47 | 4 | 3 | 3 | 14657.000 | 2517 | 12140.000 | 0.166 | 6009 | 53 | 0.514 | 0.172 |
1112 rows × 20 columns
# replace values with missing
data.Income_Category.replace(to_replace="abc", value=np.nan, inplace=True)
# subset to view incorrect values
data[data.Income_Category == "abc"]
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio |
|---|
# using MODE to fill the missing values in income column
data.Income_Category = data.Income_Category.fillna(value=data["Income_Category"].value_counts().index[0]
)
# replace values with missing values in income column
data.Income_Category.replace(to_replace="abc", value=np.nan, inplace=False)
0 $60K - $80K
1 Less than $40K
2 $80K - $120K
3 Less than $40K
4 $60K - $80K
...
10122 $40K - $60K
10123 $40K - $60K
10124 Less than $40K
10125 $40K - $60K
10126 Less than $40K
Name: Income_Category, Length: 10127, dtype: object
# checking income value replacement
data.Income_Category.value_counts()
Less than $40K 4673 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 $120K + 727 Name: Income_Category, dtype: int64
The income category column mode of Less than $40k have been used to fill all the "abc" values in the column.
Questions:
total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# Observations on Customer_age
histogram_boxplot(data, "Customer_Age")
Observation:
# observations on credit limit
histogram_boxplot(data, "Credit_Limit")
Observation:
# observations on revolving balance
histogram_boxplot(data, "Total_Revolving_Bal")
Observation:
# observations on dependents counts
histogram_boxplot(data, "Dependent_count")
Observation:
# Checking 10 largest values of amount of credit limit
data.Total_Revolving_Bal.nlargest(10)
3 2517 8 2517 18 2517 32 2517 42 2517 53 2517 76 2517 131 2517 133 2517 139 2517 Name: Total_Revolving_Bal, dtype: int64
# observations on MntFishProducts
histogram_boxplot(data, "Total_Revolving_Bal")
Observation:
# observations on Avg_Open_To_Buy
histogram_boxplot(data, "Avg_Open_To_Buy")
Observation:
# observations on Total_Trans_Amt
histogram_boxplot(data, "Total_Trans_Amt")
Observation:
# observations on Total_Trans_Ct
histogram_boxplot(data, "Total_Trans_Ct")
Observations:
# observations on Total_Ct_Chng_Q4_Q1
histogram_boxplot(data, "Total_Ct_Chng_Q4_Q1")
Observation:
# observations on Total_Amt_Chng_Q4_Q1
histogram_boxplot(data, "Total_Amt_Chng_Q4_Q1")
Observation;
# observations on Avg_Utilization_Ratio
histogram_boxplot(data, "Avg_Utilization_Ratio")
Observation:
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 6))
else:
plt.figure(figsize=(n + 1, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# observations on Attrition_Flag
labeled_barplot(data, "Attrition_Flag")
Observations:
# observations on Gender
labeled_barplot(data, "Gender")
Observation:
# observations on education level
labeled_barplot(data, "Education_Level")
Observation:
# observations on card category
labeled_barplot(data, "Card_Category")
Observation:
# observations on marital status
labeled_barplot(data, "Marital_Status")
Observation:
# observations on Contact counts with the last 12 months
labeled_barplot(data, "Contacts_Count_12_mon")
Observation:
# box plots on 2 variables
plt.figure(figsize=(10, 7))
plt.title('BoxPlot:Attrition Categories as related to the Age of the Customers')
sns.boxplot(x="Attrition_Flag", y="Customer_Age", data=data, orient="vertical");
Observation:
plt.figure(figsize=(10, 7))
plt.title('BoxPlot:Attrition Categories as related to the Total_Revolving_Balance of the Customers')
sns.boxplot(x="Attrition_Flag", y="Total_Revolving_Bal", data=data, orient="vertical");
Observation:
plt.figure(figsize=(10, 7))
plt.title('BoxPlot:Attrition Categories as related to the Ave_Open_To_Buy of the Customers')
sns.boxplot(x="Attrition_Flag", y="Avg_Open_To_Buy", data=data);
Observation:
plt.figure(figsize=(10, 7))
plt.title('BoxPlot:Attrition Categories as related to the Ave_Open_To_Buy of the Customers')
sns.boxplot(x="Attrition_Flag", y="Dependent_count", data=data);
Observation:
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 80)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 3))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### Function to plot distributions
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 7))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target variable=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)
axs[0, 1].set_title("Distribution of target for target variable=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
stacked_barplot(data, "Customer_Age", "Attrition_Flag");
Attrition_Flag Attrited Customer Existing Customer All Customer_Age All 1627 8500 10127 43 85 388 473 48 85 387 472 44 84 416 500 46 82 408 490 45 79 407 486 49 79 416 495 47 76 403 479 41 76 303 379 50 71 381 452 54 69 238 307 40 64 297 361 42 62 364 426 53 59 328 387 52 58 318 376 51 58 340 398 55 51 228 279 39 48 285 333 38 47 256 303 56 43 219 262 59 40 117 157 37 37 223 260 57 33 190 223 58 24 133 157 36 24 197 221 35 21 163 184 33 20 107 127 34 19 127 146 32 17 89 106 61 17 76 93 62 17 76 93 30 15 55 70 31 13 78 91 60 13 114 127 65 9 92 101 63 8 57 65 29 7 49 56 26 6 72 78 64 5 38 43 27 3 29 32 28 1 28 29 66 1 1 2 68 1 1 2 67 0 4 4 70 0 1 1 73 0 1 1 --------------------------------------------------------------------------------
Observation:
distribution_plot_wrt_target(data, "Customer_Age", "Attrition_Flag");
Observation:
stacked_barplot(data, "Gender", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Gender All 1627 8500 10127 F 930 4428 5358 M 697 4072 4769 --------------------------------------------------------------------------------
Observation:
stacked_barplot(data, "Card_Category", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Card_Category All 1627 8500 10127 Blue 1519 7917 9436 Silver 82 473 555 Gold 21 95 116 Platinum 5 15 20 --------------------------------------------------------------------------------
Observation:
stacked_barplot(data, "Dependent_count", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Dependent_count All 1627 8500 10127 3 482 2250 2732 2 417 2238 2655 1 269 1569 1838 4 260 1314 1574 0 135 769 904 5 64 360 424 --------------------------------------------------------------------------------
Observation:
stacked_barplot(data, "Marital_Status", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Marital_Status All 1498 7880 9378 Married 709 3978 4687 Single 668 3275 3943 Divorced 121 627 748 --------------------------------------------------------------------------------
Observation:
stacked_barplot(data, "Education_Level", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Education_Level All 1371 7237 8608 Graduate 487 2641 3128 High School 306 1707 2013 Uneducated 237 1250 1487 College 154 859 1013 Doctorate 95 356 451 Post-Graduate 92 424 516 --------------------------------------------------------------------------------
Observation:
stacked_barplot(data, "Income_Category", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Income_Category All 1627 8500 10127 Less than $40K 799 3874 4673 $40K - $60K 271 1519 1790 $80K - $120K 242 1293 1535 $60K - $80K 189 1213 1402 $120K + 126 601 727 --------------------------------------------------------------------------------
Observation:
The dataset show 5 categories of income levels.
Income level of less than 40,000 is more likely to churn than others.
distribution_plot_wrt_target(data, "Months_on_book", "Attrition_Flag")
Observation:
stacked_barplot(data, "Total_Relationship_Count", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Total_Relationship_Count All 1627 8500 10127 3 400 1905 2305 2 346 897 1243 1 233 677 910 5 227 1664 1891 4 225 1687 1912 6 196 1670 1866 --------------------------------------------------------------------------------
ObservationL
stacked_barplot(data, "Months_Inactive_12_mon", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Months_Inactive_12_mon All 1627 8500 10127 3 826 3020 3846 2 505 2777 3282 4 130 305 435 1 100 2133 2233 5 32 146 178 6 19 105 124 0 15 14 29 --------------------------------------------------------------------------------
Observation:
stacked_barplot(data, "Contacts_Count_12_mon", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Contacts_Count_12_mon All 1627 8500 10127 3 681 2699 3380 2 403 2824 3227 4 315 1077 1392 1 108 1391 1499 5 59 117 176 6 54 0 54 0 7 392 399 --------------------------------------------------------------------------------
Observation:
distribution_plot_wrt_target(data, "Credit_Limit", "Attrition_Flag")
distribution_plot_wrt_target(data, "Total_Revolving_Bal", "Attrition_Flag")
distribution_plot_wrt_target(data, "Avg_Open_To_Buy", "Attrition_Flag")
distribution_plot_wrt_target(data, "Total_Amt_Chng_Q4_Q1", "Attrition_Flag")
distribution_plot_wrt_target(data, "Total_Trans_Amt", "Attrition_Flag")
Observation:
This variable have a bi-normal distribution and with 4 peaks.
The distribution of values is well spread within the variable.
The boxplot shows outliers in the variable and the outliers is quite large.
Most of the customers behaved outside the ordinary observations in the variable.
distribution_plot_wrt_target(data, "Total_Trans_Ct", "Attrition_Flag")
Observation:
This variable have a normal distribution and its it slighly skewed to the right.
The boxplot shows outliers in the variable.
Few customers behaved outside the ordinary observations in the variable.
The normalized boxplot without outliers shows that the attrited customers customers have less total tarnsaction counts than existing ones.
distribution_plot_wrt_target(data, "Avg_Utilization_Ratio", "Attrition_Flag")
Observation:
This variable have a not normal distribution and its it heavily skewed to the right.
Both existing and attrited customers have very low average utilization ratio.
The boxplot with resoect to our target variable shows outliers in the variable and the outliers is quite large.
Most of the customers behaved outside the ordinary observations in the variable.
There is very low next to thing values in the boxplot without outliers shows that the average utilization is 0
distribution_plot_wrt_target(data, "Total_Ct_Chng_Q4_Q1", "Attrition_Flag")
Observation:
This variable have a normal distribution and its it slighly skewed to the right.
The boxplot shows outliers in the variable and the outliers is quite large.
Most of the customers behaved outside the ordinary observations in the variable.
The normalized boxplot without outliers shows that the percentage change is arounf 0.55%
plt.figure(figsize=(15,10))
sns.heatmap(data.corr(),annot=True,cmap='Spectral')
plt.show()
Observation:
Some of the variable in the distribution have noticiable positive and negative correlations
(1) Customer_Age and Months_on_book is positively correlated - 79%
(2) Avg_Utilization_Ratio and Total_Revolving_Bal - 62%
(3) Total_Trans_Amt and Total_Trans_Ct - 81% (inter alia)
(1) Avg_Utilization_Ratio and Avg_Open_to_Buy - 54%
(2) Avg_Utilization_Ratio and Credit_Limit - 48%
(3) Total_Relationship_Count and Total_Trans_Amount 35% (inter alia)
# pairplot using Attrition flag, the dependent variable as emphasis
plt.figure(figsize=(15,7))
sns.pairplot(data, hue="Attrition_Flag")
plt.show()
plt.savefig("output.jpg", bbox_inches='tight')
<Figure size 1500x700 with 0 Axes>
<Figure size 640x480 with 0 Axes>
Observation:
There are overlaps i.e. no clear distinction in the distribution of variables for people who have attrited and did not attrit.
# labeling the target variable - Attrition
data.Attrition_Flag = data.Attrition_Flag.replace(
to_replace={"Attrited Customer": 1, "Existing Customer": 0}
)
data.Attrition_Flag
0 0
1 0
2 0
3 0
4 0
..
10122 0
10123 1
10124 1
10125 1
10126 1
Name: Attrition_Flag, Length: 10127, dtype: int64
# cross-checking the above label
data.sample(10)
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9210 | 0 | 49 | F | 3 | NaN | Married | Less than $40K | Blue | 36 | 1 | 1 | 3 | 3767.000 | 1106 | 2661.000 | 0.944 | 15578 | 126 | 0.826 | 0.294 |
| 6586 | 0 | 45 | F | 4 | High School | Single | Less than $40K | Blue | 35 | 3 | 2 | 2 | 1502.000 | 0 | 1502.000 | 0.739 | 4890 | 61 | 0.794 | 0.000 |
| 618 | 0 | 38 | F | 3 | High School | Married | Less than $40K | Blue | 29 | 4 | 2 | 2 | 11327.000 | 1528 | 9799.000 | 1.142 | 2033 | 38 | 0.727 | 0.135 |
| 5067 | 1 | 48 | F | 2 | College | Married | Less than $40K | Blue | 37 | 5 | 2 | 4 | 2220.000 | 0 | 2220.000 | 0.737 | 2456 | 45 | 0.452 | 0.000 |
| 402 | 0 | 48 | M | 3 | College | Married | $80K - $120K | Blue | 42 | 3 | 3 | 0 | 7477.000 | 0 | 7477.000 | 0.511 | 1159 | 32 | 0.684 | 0.000 |
| 3063 | 0 | 35 | M | 5 | Uneducated | Single | $60K - $80K | Blue | 28 | 4 | 2 | 3 | 4587.000 | 1704 | 2883.000 | 0.903 | 2968 | 65 | 0.806 | 0.371 |
| 2331 | 0 | 50 | F | 3 | High School | Single | Less than $40K | Blue | 38 | 3 | 3 | 1 | 1438.300 | 663 | 775.300 | 0.384 | 2987 | 73 | 0.622 | 0.461 |
| 6015 | 0 | 48 | F | 3 | High School | Married | $40K - $60K | Blue | 38 | 3 | 2 | 2 | 2151.000 | 0 | 2151.000 | 0.763 | 4604 | 86 | 0.720 | 0.000 |
| 5141 | 0 | 46 | F | 2 | High School | Single | Less than $40K | Blue | 36 | 3 | 3 | 4 | 2283.000 | 1020 | 1263.000 | 0.944 | 3687 | 68 | 1.125 | 0.447 |
| 6981 | 0 | 57 | F | 2 | Post-Graduate | Single | $40K - $60K | Blue | 36 | 3 | 4 | 2 | 4784.000 | 1629 | 3155.000 | 0.706 | 4912 | 82 | 0.783 | 0.341 |
# Separating target variable and other variables in the dataset
X = data.drop(columns="Attrition_Flag")
# make dependent variable
Y = data["Attrition_Flag"]
# check sample of the columns
X.sample(7)
| Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4587 | 45 | M | 4 | High School | Single | $80K - $120K | Blue | 34 | 3 | 2 | 3 | 15913.000 | 1696 | 14217.000 | 0.632 | 4472 | 86 | 0.792 | 0.107 |
| 2680 | 52 | F | 4 | NaN | Married | Less than $40K | Blue | 36 | 1 | 4 | 5 | 2143.000 | 0 | 2143.000 | 0.853 | 1353 | 36 | 0.333 | 0.000 |
| 3351 | 46 | F | 2 | High School | NaN | Less than $40K | Blue | 32 | 4 | 1 | 3 | 7966.000 | 1423 | 6543.000 | 0.987 | 3780 | 75 | 0.923 | 0.179 |
| 3045 | 33 | M | 4 | High School | Single | $40K - $60K | Blue | 25 | 4 | 1 | 1 | 8650.000 | 1929 | 6721.000 | 0.777 | 2878 | 63 | 1.032 | 0.223 |
| 7513 | 46 | F | 2 | Graduate | Single | Less than $40K | Blue | 31 | 5 | 2 | 3 | 8243.000 | 927 | 7316.000 | 0.568 | 4503 | 90 | 0.765 | 0.112 |
| 5643 | 50 | M | 0 | Post-Graduate | Married | $80K - $120K | Blue | 36 | 5 | 1 | 3 | 28253.000 | 0 | 28253.000 | 0.479 | 4525 | 67 | 0.763 | 0.000 |
| 7915 | 32 | F | 0 | NaN | Divorced | Less than $40K | Blue | 36 | 6 | 3 | 3 | 1732.000 | 0 | 1732.000 | 0.683 | 2479 | 43 | 0.536 | 0.000 |
# setting dummy variables for processing
X = pd.get_dummies(data=data,
columns=[
"Gender",
"Education_Level", # this variables has missing values
"Marital_Status", # this variable has missing values
"Income_Category",
"Card_Category",
],
drop_first=True,
)
# drop target variable
X = X.drop(columns="Attrition_Flag")
# check sample heads of the data
X.sample(7)
| Customer_Age | Dependent_count | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | Gender_M | Education_Level_Doctorate | Education_Level_Graduate | Education_Level_High School | Education_Level_Post-Graduate | Education_Level_Uneducated | Marital_Status_Married | Marital_Status_Single | Income_Category_$40K - $60K | Income_Category_$60K - $80K | Income_Category_$80K - $120K | Income_Category_Less than $40K | Card_Category_Gold | Card_Category_Platinum | Card_Category_Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 753 | 26 | 0 | 19 | 5 | 2 | 2 | 2187.000 | 1597 | 590.000 | 0.777 | 2560 | 46 | 0.314 | 0.730 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1320 | 34 | 2 | 21 | 3 | 3 | 3 | 6259.000 | 1063 | 5196.000 | 0.668 | 1836 | 50 | 0.515 | 0.170 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5717 | 43 | 2 | 39 | 3 | 1 | 2 | 18524.000 | 1736 | 16788.000 | 0.762 | 4496 | 64 | 0.882 | 0.094 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 4991 | 43 | 4 | 31 | 6 | 1 | 3 | 6467.000 | 1887 | 4580.000 | 0.442 | 4098 | 61 | 0.605 | 0.292 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6107 | 48 | 2 | 37 | 4 | 2 | 1 | 3378.000 | 0 | 3378.000 | 0.580 | 4761 | 81 | 0.884 | 0.000 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 4239 | 43 | 2 | 35 | 3 | 3 | 4 | 6109.000 | 1549 | 4560.000 | 0.683 | 4430 | 69 | 0.865 | 0.254 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 5862 | 49 | 1 | 31 | 5 | 2 | 1 | 5714.000 | 868 | 4846.000 | 0.636 | 4331 | 85 | 0.700 | 0.152 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
# print the percentage of each unique value in the Y variable series
Y.value_counts() / 100
0 85.000 1 16.270 Name: Attrition_Flag, dtype: float64
# first we split data into 2 parts, say temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, Y, test_size=0.20, random_state=1, stratify=Y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp,
y_temp,
test_size=0.25,
random_state=1,
stratify=y_temp, # set the weighting feature on
)
print("The Train Set:",X_train.shape)
print("The Validation Set:", X_val.shape)
print("The Testing Set:", X_test.shape)
The Train Set: (6075, 29) The Validation Set: (2026, 29) The Testing Set: (2026, 29)
# showing the missing values
data.isna().sum()
Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
# view missing data for the affected variables
data[data.Education_Level.isnull()]
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 0 | 51 | M | 4 | NaN | Married | $120K + | Gold | 46 | 6 | 1 | 3 | 34516.000 | 2264 | 32252.000 | 1.975 | 1330 | 31 | 0.722 | 0.066 |
| 11 | 0 | 65 | M | 1 | NaN | Married | $40K - $60K | Blue | 54 | 6 | 2 | 3 | 9095.000 | 1587 | 7508.000 | 1.433 | 1314 | 26 | 1.364 | 0.174 |
| 15 | 0 | 44 | M | 4 | NaN | NaN | $80K - $120K | Blue | 37 | 5 | 1 | 2 | 4234.000 | 972 | 3262.000 | 1.707 | 1348 | 27 | 1.700 | 0.230 |
| 17 | 0 | 41 | M | 3 | NaN | Married | $80K - $120K | Blue | 34 | 4 | 4 | 1 | 13535.000 | 1291 | 12244.000 | 0.653 | 1028 | 21 | 1.625 | 0.095 |
| 23 | 0 | 47 | F | 4 | NaN | Single | Less than $40K | Blue | 36 | 3 | 3 | 2 | 2492.000 | 1560 | 932.000 | 0.573 | 1126 | 23 | 0.353 | 0.626 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10090 | 0 | 36 | F | 3 | NaN | Married | $40K - $60K | Blue | 22 | 5 | 3 | 3 | 12958.000 | 2273 | 10685.000 | 0.608 | 15681 | 96 | 0.627 | 0.175 |
| 10094 | 0 | 59 | M | 1 | NaN | Single | $60K - $80K | Blue | 48 | 3 | 1 | 2 | 7288.000 | 0 | 7288.000 | 0.640 | 14873 | 120 | 0.714 | 0.000 |
| 10095 | 0 | 46 | M | 3 | NaN | Married | $80K - $120K | Blue | 33 | 4 | 1 | 3 | 34516.000 | 1099 | 33417.000 | 0.816 | 15490 | 110 | 0.618 | 0.032 |
| 10118 | 1 | 50 | M | 1 | NaN | NaN | $80K - $120K | Blue | 36 | 6 | 3 | 4 | 9959.000 | 952 | 9007.000 | 0.825 | 10310 | 63 | 1.100 | 0.096 |
| 10123 | 1 | 41 | M | 2 | NaN | Divorced | $40K - $60K | Blue | 25 | 4 | 2 | 3 | 4277.000 | 2186 | 2091.000 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
1519 rows × 20 columns
# Let us now impute the missing values
imputer = SimpleImputer(missing_values=np.nan, strategy="median")
impute = imputer.fit(X_train)
# impute on train, validation set and test set
X_train = impute.transform(X_train)
X_val = imputer.transform(X_val)
X_test = imputer.transform(X_test)
# check if the missing values have been removed from the training set
np.nan in X_train
False
# let us get column count
len(X_train.T)
29
The nature of predictions made by the classification model will translate as follows:
Which metric to optimize?
Model can make wrong predictions as:
Predicting a customer close their credit card account and the customer doesn't - Resources Loss
Predicting a customer will not close their credit card account and the customer closed it - Opportunity Loss
Which case is more important?
How to reduce this loss i.e need to reduce False Negatives?
The Company would like Recall to be maximized, the greater the Recall, the lower the chances of false negatives.
Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1
},
index=[0],
)
return df_perf
Sample code for model building with original data
models = [] # Empty list to store all the models
# Appending models into the list
models.append(
(
"DecisionTree",
DecisionTreeClassifier(random_state=1,class_weight={0: 15, 1: 85}),
)
)
# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss"),))
models.append(("Gradient Boost", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
print("\n" "Training Performance:" "\n")
print("*" * 40)
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_train, model.predict(X_train))
print("{}: {}".format(name, scores))
print("-" * 40)
print("\n" "Validation Performance:" "\n")
print("*" * 40)
for name, model in models:
model.fit(X_train, y_train)
scores_val = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores_val))
# loop through all models to get the mean cross validated score
print("-" * 40)
print("\n" "Cross-Validation Performance:" "\n")
print("*" * 40)
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=10, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
model.fit(X_train, y_train)
print("{}: {}".format(name, cv_result.mean() * 100))
Training Performance: **************************************** DecisionTree: 1.0 Bagging: 0.9866803278688525 Random forest: 1.0 Xgboost: 1.0 Gradient Boost: 0.8780737704918032 Adaboost: 0.8381147540983607 ---------------------------------------- Validation Performance: **************************************** DecisionTree: 0.803680981595092 Bagging: 0.7975460122699386 Random forest: 0.8190184049079755 Xgboost: 0.8834355828220859 Gradient Boost: 0.8588957055214724 Adaboost: 0.8588957055214724 ---------------------------------------- Cross-Validation Performance: **************************************** DecisionTree: 75.41237113402062 Bagging: 78.99852724594993 Random forest: 75.92047128129602 Xgboost: 86.37071323374711 Gradient Boost: 82.17126025667997 Adaboost: 81.24658110666948
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("CV Comparison of the Models define above")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
Observation on the Comparison Box Plot above:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perform = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perform
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
con_mtx = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / con_mtx.flatten().sum())]
for item in con_mtx.flatten()
]
).reshape(2, 2)
# plot the matrix inside a heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(con_mtx, annot=labels, fmt="")
plt.ylabel("Observed label")
plt.xlabel("Predicted label")
%%time
# defining model
model = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in GridSearchCV
param_grid = {
"init": [AdaBoostClassifier(random_state=1), DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
Best Parameters:{'init': AdaBoostClassifier(random_state=1), 'learning_rate': 0.1, 'max_features': 0.7, 'n_estimators': 100, 'subsample': 0.9}
Score: 0.8155363683935113
CPU times: user 8.05 s, sys: 947 ms, total: 9 s
Wall time: 7min 4s
# extract best estimator from the model
gbgscv = grid_cv.best_estimator_
# fit the model
gbgscv=gbgscv.fit(X_train, y_train)
print(gbgscv)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.7, random_state=1, subsample=0.9)
# Calculating different metrics on train set
print("Training performance:")
gradient_grid_train = model_performance_classification_sklearn(gbgscv, X_train, y_train)
display(gradient_grid_train)
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.973 | 0.874 | 0.955 | 0.913 |
# creating confusion matrix
confusion_matrix_sklearn(gbgscv, X_train, y_train)
# Calculating different metrics on validation set
print("Validation performance:")
gradient_grid_val = model_performance_classification_sklearn(gbgscv, X_val, y_val)
display(gradient_grid_val)
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.970 | 0.868 | 0.943 | 0.904 |
# creating confusion matrix
confusion_matrix_sklearn(gbgscv, X_val, y_val)
Observation:
%%time
# defining model
model = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"init": [AdaBoostClassifier(random_state=1), DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=3, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'n_estimators': 100, 'max_features': 0.5, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.8155733836715431:
CPU times: user 2.87 s, sys: 192 ms, total: 3.06 s
Wall time: 1min 29s
# extract best estimator from the model
gbrs = randomized_cv.best_estimator_
# Fit the model on training data
gbrs=gbrs.fit(X_train, y_train)
print(gbrs)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.5, random_state=1, subsample=0.9)
# Calculating different metrics on train set
print("Training performance:")
gradient_random_train = model_performance_classification_sklearn(gbrs, X_train, y_train)
display(gradient_random_train)
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.972 | 0.866 | 0.955 | 0.908 |
# creating confusion matrix
confusion_matrix_sklearn(gbrs, X_train, y_train)
# Calculating different metrics on validation set
print("Validation performance:")
gradient_random_val = model_performance_classification_sklearn(gbrs, X_val, y_val)
display(gradient_random_val)
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.968 | 0.856 | 0.943 | 0.897 |
# creating confusion matrix
confusion_matrix_sklearn(gbrs, X_val, y_val)
Observation:
%%time
# defining model
model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in GridSearchCV
param_grid = {
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"base_estimator": [
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scorer, cv=3, n_jobs = -1)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
Best Parameters:{'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1), 'learning_rate': 0.1, 'n_estimators': 100}
Score: 0.8370992606575429
CPU times: user 3.01 s, sys: 113 ms, total: 3.13 s
Wall time: 57.9 s
# get best model
adb_tuned1 = grid_cv.best_estimator_
# fit the model
adb_tuned1=adb_tuned1.fit(X_train, y_train)
print(adb_tuned1)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.1, n_estimators=100, random_state=1)
# Calculating different metrics on train set
Adaboost_grid_train = model_performance_classification_sklearn(
adb_tuned1, X_train, y_train
)
print("Training performance:")
Adaboost_grid_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.983 | 0.926 | 0.966 | 0.946 |
# creating confusion matrix
confusion_matrix_sklearn(adb_tuned1, X_train, y_train)
# Calculating different metrics on validation set
Adaboost_grid_val = model_performance_classification_sklearn(adb_tuned1, X_val, y_val)
print("Validation performance:")
Adaboost_grid_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.971 | 0.874 | 0.941 | 0.906 |
# creating confusion matrix
confusion_matrix_sklearn(adb_tuned1, X_val, y_val)
Observation:
%%time
# defining model
model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"base_estimator": [
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=3, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 100, 'learning_rate': 0.1, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.8370992606575429:
CPU times: user 3.6 s, sys: 105 ms, total: 3.71 s
Wall time: 57.2 s
# building model with best parameters
adb_tuned2 = randomized_cv.best_estimator_
# Fit the model on training data
adb_tuned2=adb_tuned2.fit(X_train, y_train)
print(adb_tuned2)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.1, n_estimators=100, random_state=1)
# Calculating different metrics on train set
Adaboost_random_train = model_performance_classification_sklearn(
adb_tuned2, X_train, y_train
)
print("Training performance:")
Adaboost_random_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.983 | 0.926 | 0.966 | 0.946 |
# creating confusion matrix
confusion_matrix_sklearn(adb_tuned2, X_train, y_train)
# Calculating different metrics on validation set
Adaboost_random_val = model_performance_classification_sklearn(adb_tuned2, X_val, y_val)
print("Validation performance:")
Adaboost_random_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.971 | 0.874 | 0.941 | 0.906 |
# creating confusion matrix
confusion_matrix_sklearn(adb_tuned2, X_val, y_val)
Observation:
%%time
#defining model
model = XGBClassifier(random_state=1,eval_metric='logloss')
#Parameter grid to pass in GridSearchCV
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.1,0.05],
'gamma':[1,3],
'subsample':[0.7,0.9]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling GridSearchCV
grid_cv = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scorer, cv=3, n_jobs = -1, verbose= 2)
#Fitting parameters in GridSeachCV
grid_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(grid_cv.best_params_,grid_cv.best_score_))
Fitting 3 folds for each of 108 candidates, totalling 324 fits
Best parameters are {'gamma': 3, 'learning_rate': 0.05, 'n_estimators': 100, 'scale_pos_weight': 5, 'subsample': 0.7} with CV score=0.9180116407110273:
CPU times: user 2.59 s, sys: 147 ms, total: 2.73 s
Wall time: 1min 11s
# building model with best parameters
xgb_tuned1 = grid_cv.best_estimator_
# Fit the model on training data
xgb_tuned1.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=3, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.05, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=100,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=3, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.05, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=100,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)# Calculating different metrics on train set
xgboost_grid_train = model_performance_classification_sklearn(
xgb_tuned1, X_train, y_train
)
print("Training performance:")
xgboost_grid_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.980 | 0.998 | 0.890 | 0.941 |
# creating confusion matrix
confusion_matrix_sklearn(xgb_tuned1, X_train, y_train)
# Calculating different metrics on validation set
xgboost_grid_val = model_performance_classification_sklearn(xgb_tuned1, X_val, y_val)
print("Validation performance:")
xgboost_grid_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.959 | 0.942 | 0.825 | 0.880 |
# creating confusion matrix
confusion_matrix_sklearn(xgb_tuned1, X_val, y_val)
Observation:
%%time
# defining model
model = XGBClassifier(random_state=1,eval_metric='logloss')
# Parameter grid to pass in RandomizedSearchCV
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.1,0.05],
'gamma':[1,3],
'subsample':[0.7,0.9]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=3, random_state=1, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'scale_pos_weight': 5, 'n_estimators': 75, 'learning_rate': 0.05, 'gamma': 3} with CV score=0.9149347176341042:
CPU times: user 1.49 s, sys: 88 ms, total: 1.58 s
Wall time: 30.3 s
# building model with best parameters
xgb_tuned2 = randomized_cv.best_estimator_
# Fit the model on training data
xgb_tuned2.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=3, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.05, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=75,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=3, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.05, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=75,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)# Calculating different metrics on train set
xgboost_random_train = model_performance_classification_sklearn(
xgb_tuned2, X_train, y_train
)
print("Training performance:")
xgboost_random_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.974 | 0.995 | 0.866 | 0.926 |
# creating confusion matrix
confusion_matrix_sklearn(xgb_tuned2, X_train, y_train)
# Calculating different metrics on validation set
xgboost_random_val = model_performance_classification_sklearn(xgb_tuned2, X_val, y_val)
print("Validation performance:")
xgboost_random_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.957 | 0.939 | 0.820 | 0.876 |
# creating confusion matrix
confusion_matrix_sklearn(xgb_tuned2, X_val, y_val)
Observation:
# print counts of target variable before oversampling
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
# fit and resample the training data
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
# prit data size after the oversampling
print("After OverSampling, counts of label '1': {}".format(sum(y_train_over == 1)))
print("After OverSampling, counts of label '0': {} \n".format(sum(y_train_over == 0)))
print("After OverSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before OverSampling, counts of label '1': 976 Before OverSampling, counts of label '0': 5099 After OverSampling, counts of label '1': 5099 After OverSampling, counts of label '0': 5099 After OverSampling, the shape of train_X: (10198, 29) After OverSampling, the shape of train_y: (10198,)
print("Before OverSampling, counts of label '1': {}".format(sum(y_val == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_val == 0)))
sm = SMOTE(
sampling_strategy=1, k_neighbors=5, random_state=1
) # Synthetic Minority Over Sampling Technique
X_val_over, y_val_over = sm.fit_resample(X_val, y_val)
print("After OverSampling, counts of label '1': {}".format(sum(y_val_over == 1)))
print("After OverSampling, counts of label '0': {} \n".format(sum(y_val_over == 0)))
print("After OverSampling, the shape of train_X: {}".format(X_val_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_val_over.shape))
Before OverSampling, counts of label '1': 326 Before OverSampling, counts of label '0': 1700 After OverSampling, counts of label '1': 1700 After OverSampling, counts of label '0': 1700 After OverSampling, the shape of train_X: (3400, 29) After OverSampling, the shape of train_y: (3400,)
# fit the model
model = GradientBoostingClassifier(random_state=1)
gbost = model.fit(X_train_over, y_train_over)
# Calculating different metrics on train set
gradient_over = model_performance_classification_sklearn(
gbost, X_train_over, y_train_over
)
print("Training performance:")
gradient_over
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.981 | 0.981 | 0.981 | 0.981 |
# creating confusion matrix
confusion_matrix_sklearn(gbost, X_train_over, y_train_over)
# fit the model
model = GradientBoostingClassifier(random_state=1)
gbosv = model.fit(X_val_over, y_val_over)
# Calculating different metrics on train set
gradient_over_val = model_performance_classification_sklearn(
gbosv, X_val_over, y_val_over
)
print("Validation performance:")
gradient_over_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.992 | 0.994 | 0.991 | 0.992 |
# creating confusion matrix
confusion_matrix_sklearn(gbosv, X_val_over, y_val_over)
# fit the model
model = AdaBoostClassifier(random_state=1)
adost = model.fit(X_train_over, y_train_over)
# Calculating different metrics on train set
Adaboost_over = model_performance_classification_sklearn(
adost, X_train_over, y_train_over
)
print("Training performance:")
Adaboost_over
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.968 | 0.969 | 0.967 | 0.968 |
# creating confusion matrix
confusion_matrix_sklearn(adost, X_val_over, y_val_over)
# fit the model
model = AdaBoostClassifier(random_state=1)
adosv = model.fit(X_val_over, y_val_over)
# Calculating different metrics on train set
Adaboost_over_val = model_performance_classification_sklearn(
adosv, X_val_over, y_val_over
)
print("Validation performance:")
Adaboost_over_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.974 | 0.976 | 0.971 | 0.974 |
# creating confusion matrix
confusion_matrix_sklearn(adosv, X_val_over, y_val_over)
# defining model
model = XGBClassifier(random_state=1, eval_metric="logloss")
xgbost = model.fit(X_train_over, y_train_over)
# Calculating different metrics on train set
xgboost_over = model_performance_classification_sklearn(
xgbost, X_train_over, y_train_over
)
print("Training performance:")
xgboost_over
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.000 | 1.000 | 1.000 | 1.000 |
# creating confusion matrix
confusion_matrix_sklearn(xgbost, X_train_over, y_train_over)
# fit the model
model = XGBClassifier(random_state=1, eval_metric="logloss")
xgbosv = model.fit(X_val_over, y_val_over)
# Calculating different metrics on train set
xgboost_over_val = model_performance_classification_sklearn(
xgbosv, X_val_over, y_val_over
)
print("Validation performance:")
xgboost_over_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.000 | 1.000 | 1.000 | 1.000 |
# creating confusion matrix
confusion_matrix_sklearn(xgbosv, X_val_over, y_val_over)
# define undersample strategy
undersample = RandomUnderSampler(sampling_strategy=0.5)
# fit and apply the transform
X_under, y_under = undersample.fit_resample(X_train, y_train)
display(X_train.shape)
display(X_under.shape)
(6075, 29)
(2928, 29)
# define undersample strategy
undersample = RandomUnderSampler(sampling_strategy=0.5)
# fit and apply the transform
X_under_val, y_under_val = undersample.fit_resample(X_val, y_val)
display(X_val.shape)
display(X_under_val.shape)
(2026, 29)
(978, 29)
# fit the model
model = GradientBoostingClassifier(random_state=1)
gbust = model.fit(X_under, y_under)
# Calculating different metrics on train set
gradient_under = model_performance_classification_sklearn(gbust, X_under, y_under)
print("Training performance:")
gradient_under
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.971 | 0.950 | 0.962 | 0.956 |
# creating confusion matrix
confusion_matrix_sklearn(gbust, X_under, y_under)
# fit the model
model = GradientBoostingClassifier(random_state=1)
gbusv = model.fit(X_under_val, y_under_val)
# Calculating different metrics on train set
gradient_under_val = model_performance_classification_sklearn(
gbusv, X_under_val, y_under_val
)
print("Validation performance:")
gradient_under_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.991 | 0.985 | 0.988 | 0.986 |
# creating confusion matrix
confusion_matrix_sklearn(gbusv, X_under_val, y_under_val)
# fit the model
model = AdaBoostClassifier(random_state=1)
adust = model.fit(X_under, y_under)
# Calculating different metrics on train set
Adaboost_under = model_performance_classification_sklearn(adust, X_under, y_under)
print("Training performance:")
Adaboost_under
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.946 | 0.916 | 0.921 | 0.918 |
# creating confusion matrix
confusion_matrix_sklearn(adust, X_under, y_under)
# fit the model
model = AdaBoostClassifier(random_state=1)
adusv = model.fit(X_under_val, y_under_val)
# Calculating different metrics on train set
Adaboost_under_val = model_performance_classification_sklearn(
adusv, X_under_val, y_under_val
)
print("Validation performance:")
Adaboost_under_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.961 | 0.936 | 0.947 | 0.941 |
# creating confusion matrix
confusion_matrix_sklearn(adusv, X_under_val, y_under_val)
# defining model
model = XGBClassifier(random_state=1, eval_metric="logloss")
xgbust = model.fit(X_under, y_under)
display(xgbust)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=None, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=None,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=None, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=None,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)# Calculating different metrics on train set
xgboost_under = model_performance_classification_sklearn(xgbust, X_under, y_under)
print("Training performance:")
xgboost_under
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.000 | 1.000 | 1.000 | 1.000 |
# creating confusion matrix
confusion_matrix_sklearn(xgbust, X_under, y_under)
# defining model
model = XGBClassifier(random_state=1, eval_metric="logloss")
xgbusv = model.fit(X_under_val, y_under_val)
# Calculating different metrics on train set
xgboost_under_val = model_performance_classification_sklearn(
xgbusv, X_under_val, y_under_val
)
print("Validation performance:")
xgboost_under_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.000 | 1.000 | 1.000 | 1.000 |
# creating confusion matrix
confusion_matrix_sklearn(xgbusv, X_under_val, y_under_val)
# concatenate the dataframes
models_train_comp_df = pd.concat(
[
gradient_grid_train.T,
gradient_grid_val.T,
gradient_random_train.T,
gradient_random_val.T,
gradient_over.T,
gradient_over_val.T,
gradient_under.T,
gradient_under_val.T,
Adaboost_grid_train.T,
Adaboost_grid_val.T,
Adaboost_random_train.T,
Adaboost_random_val.T,
Adaboost_over.T,
Adaboost_over_val.T,
Adaboost_under.T,
Adaboost_under_val.T,
xgboost_grid_train.T,
xgboost_grid_val.T,
xgboost_random_train.T,
xgboost_random_val.T,
xgboost_over.T,
xgboost_over_val.T,
xgboost_under.T,
xgboost_under_val.T,
],axis=1
)
# give the columns headers
models_train_comp_df.columns = [
"GradientBoost Tuned with GridSearchCV Train Set",
"GradientBoost Tuned with GridSearchCV Validatn Set",
"GradientBoost Tuned with RandomSearchCV Train Set",
"GradientBoost Tuned with RandomSearchCV Validatn Set",
"GradientBoost Tuned with Oversampled Train Set",
"GradientBoost Tuned with Oversampled Validatn Set",
"GradientBoost Tuned with Undersampled Train Set",
"GradientBoost Tuned with Undersampled Validatn Set",
"AdaBoost Tuned with GridSearchCV Train Set",
"AdaBoost Tuned with GridSearchCV Validatn Set",
"AdaBoost Tuned with RandomSearchCV Train Set",
"AdaBoost Tuned with RandomSearchCV Validatn Set",
"AdaBoost Tuned with Oversampled Train Set",
"AdaBoost Tuned with Oversampled Validatn Set",
"AdaBoost Tuned with Undersampled Train Set",
"AdaBoost Tuned with Undersampled Validatn Set",
"Xgboost Tuned with GridSearchCV Train Set",
"Xgboost Tuned with GridSearchCV Validatn Set",
"Xgboost Tuned with RandomSearchCV Train Set",
"Xgboost Tuned with RandomSearchCV Validatn Set",
"Xgboost Tuned with Oversampled Train Set",
"Xgboost Tuned with Oversampled Validatn Set",
"Xgboost Tuned with Undersampled Train Set",
"Xgboost Tuned with Undersampled Validatn Set",
]
# print the dataframe
print("Training & Validation performance comparison:")
print("*" * 45)
models_train_comp_df
Training & Validation performance comparison: *********************************************
| GradientBoost Tuned with GridSearchCV Train Set | GradientBoost Tuned with GridSearchCV Validatn Set | GradientBoost Tuned with RandomSearchCV Train Set | GradientBoost Tuned with RandomSearchCV Validatn Set | GradientBoost Tuned with Oversampled Train Set | GradientBoost Tuned with Oversampled Validatn Set | GradientBoost Tuned with Undersampled Train Set | GradientBoost Tuned with Undersampled Validatn Set | AdaBoost Tuned with GridSearchCV Train Set | AdaBoost Tuned with GridSearchCV Validatn Set | AdaBoost Tuned with RandomSearchCV Train Set | AdaBoost Tuned with RandomSearchCV Validatn Set | AdaBoost Tuned with Oversampled Train Set | AdaBoost Tuned with Oversampled Validatn Set | AdaBoost Tuned with Undersampled Train Set | AdaBoost Tuned with Undersampled Validatn Set | Xgboost Tuned with GridSearchCV Train Set | Xgboost Tuned with GridSearchCV Validatn Set | Xgboost Tuned with RandomSearchCV Train Set | Xgboost Tuned with RandomSearchCV Validatn Set | Xgboost Tuned with Oversampled Train Set | Xgboost Tuned with Oversampled Validatn Set | Xgboost Tuned with Undersampled Train Set | Xgboost Tuned with Undersampled Validatn Set | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.973 | 0.970 | 0.972 | 0.968 | 0.981 | 0.992 | 0.971 | 0.991 | 0.983 | 0.971 | 0.983 | 0.971 | 0.968 | 0.974 | 0.946 | 0.961 | 0.980 | 0.959 | 0.974 | 0.957 | 1.000 | 1.000 | 1.000 | 1.000 |
| Recall | 0.874 | 0.868 | 0.866 | 0.856 | 0.981 | 0.994 | 0.950 | 0.985 | 0.926 | 0.874 | 0.926 | 0.874 | 0.969 | 0.976 | 0.916 | 0.936 | 0.998 | 0.942 | 0.995 | 0.939 | 1.000 | 1.000 | 1.000 | 1.000 |
| Precision | 0.955 | 0.943 | 0.955 | 0.943 | 0.981 | 0.991 | 0.962 | 0.988 | 0.966 | 0.941 | 0.966 | 0.941 | 0.967 | 0.971 | 0.921 | 0.947 | 0.890 | 0.825 | 0.866 | 0.820 | 1.000 | 1.000 | 1.000 | 1.000 |
| F1 | 0.913 | 0.904 | 0.908 | 0.897 | 0.981 | 0.992 | 0.956 | 0.986 | 0.946 | 0.906 | 0.946 | 0.906 | 0.968 | 0.974 | 0.918 | 0.941 | 0.941 | 0.880 | 0.926 | 0.876 | 1.000 | 1.000 | 1.000 | 1.000 |
# create a dataframe from the confusion matrix false negative scores
defaultdict = {
"Gradient Boost Tuned with Gridsearch Training Set": 123,
"Gradient Boost Tuned with Gridsearch Validation Set": 43,
"Gradient Boost Tuned with Randomsearch Training Set": 131,
"Gradient Boost Tuned with Randomsearch Validation Set": 47,
"Gradient Boost Tuned with Oversampled Training Set": 99,
"Gradient Boost Tuned with Oversampled Validation Set": 11,
"Gradient Boost Tuned with Undersampled Training Set": 41,
"Gradient Boost Tuned with Undersampled Validation Set": 3,
"AdaBoost Tuned with Gridsearch Training Set": 72,
"AdaBoost Tuned with Gridsearch Validation Set": 41,
"AdaBoost Tuned with Randomsearch Training Set": 72,
"AdaBoost Tuned with Randomsearch Validation Set": 41,
"AdaBoost Tuned with Oversampled Training Set": 57,
"AdaBoost Tuned with Oversampled Validation Set": 41,
"AdaBoost Tuned with Undersampled Training Set": 78,
"AdaBoost Tuned with Undersampled Validation Set": 13,
"Xgboost Tuned with Gridsearch Training Set": 2,
"Xgboost Tuned with Gridsearch Validation Set": 19,
"Xgboost Tuned with RandomSearch Training Set": 5,
"Xgboost Tuned with RandomSearch Validation Set": 20,
"Xgboost Tuned with Oversampled Training Set": 0,
"Xgboost Tuned with Oversampled Validation Set": 0,
"Xgboost Tuned with Undersampled Training Set": 0,
"Xgboost Tuned with Undersampled Validation Set": 0,
}
false_negatives_df_train = pd.DataFrame.from_dict(
defaultdict, orient="index", columns=["False Negatives"]
)
display(false_negatives_df_train.sort_values("False Negatives"))
| False Negatives | |
|---|---|
| Xgboost Tuned with Undersampled Validation Set | 0 |
| Xgboost Tuned with Oversampled Validation Set | 0 |
| Xgboost Tuned with Oversampled Training Set | 0 |
| Xgboost Tuned with Undersampled Training Set | 0 |
| Xgboost Tuned with Gridsearch Training Set | 2 |
| Gradient Boost Tuned with Undersampled Validation Set | 3 |
| Xgboost Tuned with RandomSearch Training Set | 5 |
| Gradient Boost Tuned with Oversampled Validation Set | 11 |
| AdaBoost Tuned with Undersampled Validation Set | 13 |
| Xgboost Tuned with Gridsearch Validation Set | 19 |
| Xgboost Tuned with RandomSearch Validation Set | 20 |
| AdaBoost Tuned with Oversampled Validation Set | 41 |
| AdaBoost Tuned with Randomsearch Validation Set | 41 |
| AdaBoost Tuned with Gridsearch Validation Set | 41 |
| Gradient Boost Tuned with Undersampled Training Set | 41 |
| Gradient Boost Tuned with Gridsearch Validation Set | 43 |
| Gradient Boost Tuned with Randomsearch Validation Set | 47 |
| AdaBoost Tuned with Oversampled Training Set | 57 |
| AdaBoost Tuned with Randomsearch Training Set | 72 |
| AdaBoost Tuned with Gridsearch Training Set | 72 |
| AdaBoost Tuned with Undersampled Training Set | 78 |
| Gradient Boost Tuned with Oversampled Training Set | 99 |
| Gradient Boost Tuned with Gridsearch Training Set | 123 |
| Gradient Boost Tuned with Randomsearch Training Set | 131 |
# create a dataframe from the confusion matrix RECALL scores
defaultdict = {
"Gradient Boost Tuned with Gridsearch Training Set": 0.874,
"Gradient Boost Tuned with Gridsearch Validation Set": 0.868,
"Gradient Boost Tuned with Randomsearch Training Set": 0.866,
"Gradient Boost Tuned with Randomsearch Validation Set": 0.856,
"Gradient Boost Tuned with Oversampled Training Set": 0.981,
"Gradient Boost Tuned with Oversampled Validation Set": 0.994,
"Gradient Boost Tuned with Undersampled Training Set": 0.956,
"Gradient Boost Tuned with Undersampled Validation Set": 0.968,
"AdaBoost Tuned with Gridsearch Training Set": 0.926,
"AdaBoost Tuned with Gridsearch Validation Set": 0.874,
"AdaBoost Tuned with Randomsearch Training Set": 0.926,
"AdaBoost Tuned with Randomsearch Validation Set": 0.874,
"AdaBoost Tuned with Oversampled Training Set": 0.969,
"AdaBoost Tuned with Oversampled Validation Set": 0.976,
"AdaBoost Tuned with Undersampled Training Set": 0.910,
"AdaBoost Tuned with Undersampled Validation Set": 0.966,
"Xgboost Tuned with Gridsearch Training Set": 0.998,
"Xgboost Tuned with Gridsearch Validation Set": 0.942,
"Xgboost Tuned with RandomSearch Training Set": 0.995,
"Xgboost Tuned with RandomSearch Validation Set": 0.935,
"Xgboost Tuned with Oversampled Training Set": 1.000,
"Xgboost Tuned with Oversampled Validation Set": 1.000,
"Xgboost Tuned with Undersampled Training Set": 1.000,
"Xgboost Tuned with Undersampled Validation Set": 1.000,
}
false_negatives_df_train = pd.DataFrame.from_dict(
defaultdict, orient="index", columns=["Recall"]
)
display(false_negatives_df_train.sort_values("Recall"))
| Recall | |
|---|---|
| Gradient Boost Tuned with Randomsearch Validation Set | 0.856 |
| Gradient Boost Tuned with Randomsearch Training Set | 0.866 |
| Gradient Boost Tuned with Gridsearch Validation Set | 0.868 |
| Gradient Boost Tuned with Gridsearch Training Set | 0.874 |
| AdaBoost Tuned with Gridsearch Validation Set | 0.874 |
| AdaBoost Tuned with Randomsearch Validation Set | 0.874 |
| AdaBoost Tuned with Undersampled Training Set | 0.910 |
| AdaBoost Tuned with Gridsearch Training Set | 0.926 |
| AdaBoost Tuned with Randomsearch Training Set | 0.926 |
| Xgboost Tuned with RandomSearch Validation Set | 0.935 |
| Xgboost Tuned with Gridsearch Validation Set | 0.942 |
| Gradient Boost Tuned with Undersampled Training Set | 0.956 |
| AdaBoost Tuned with Undersampled Validation Set | 0.966 |
| Gradient Boost Tuned with Undersampled Validation Set | 0.968 |
| AdaBoost Tuned with Oversampled Training Set | 0.969 |
| AdaBoost Tuned with Oversampled Validation Set | 0.976 |
| Gradient Boost Tuned with Oversampled Training Set | 0.981 |
| Gradient Boost Tuned with Oversampled Validation Set | 0.994 |
| Xgboost Tuned with RandomSearch Training Set | 0.995 |
| Xgboost Tuned with Gridsearch Training Set | 0.998 |
| Xgboost Tuned with Oversampled Training Set | 1.000 |
| Xgboost Tuned with Oversampled Validation Set | 1.000 |
| Xgboost Tuned with Undersampled Training Set | 1.000 |
| Xgboost Tuned with Undersampled Validation Set | 1.000 |
Best Model Choice
The most suitable model evaluation metrics, in this case, are recall, using recall because it needs to take into account the percentage of False Negative (FN) that must be minimized.
The best model obtained is XGBoost Tuned with RandomSearch Training Set (xgb_tuned2) with recall scores of 0.995 and FN of 5 in training data, respectively.
Choose the model that has Recall & Precision > 0.95 and find test accuracy > 0.70 on test set.
# Calculating different metrics on the test set
gbsov_test = model_performance_classification_sklearn(gbosv, X_test, y_test)
print("GradientBoost Tuned with Oversampled Validation Set: Test performance:")
gbsov_test
GradientBoost Tuned with Oversampled Validation Set: Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.954 | 0.874 | 0.843 | 0.858 |
# Calculating different metrics on the test set
adaboost_test = model_performance_classification_sklearn(adb_tuned2, X_test, y_test)
print("AdaBoost Tuned with Random search Training Set: Test performance:")
adaboost_test
AdaBoost Tuned with Random search Training Set: Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.976 | 0.908 | 0.939 | 0.923 |
# Calculating different metrics on the test set
gbusv_test = model_performance_classification_sklearn(gbusv, X_test, y_test)
print("GradientBoost Tuned with Undersampled Validation Set: Test performance:")
gbusv_test
GradientBoost Tuned with Undersampled Validation Set: Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.955 | 0.926 | 0.818 | 0.869 |
None of these models performed above 95% on Recall and Precision and above 70% in accuracy. Let's go back to the comparison frame and try different models.
# Calculating different metrics on the test set
xgboostrv_test = model_performance_classification_sklearn(xgb_tuned2, X_test, y_test)
print("XGBoost with XGBoost Random Sampling Data: Test performance:")
xgboostrv_test
XGBoost with XGBoost Random Sampling Data: Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.956 | 0.951 | 0.809 | 0.874 |
# creating confusion matrix
confusion_matrix_sklearn(xgb_tuned2, X_test, y_test)
Observation:
This model is giving us the accuracy, recall and precision we need, along with the low false negative error rate, so that we can use it in production
Features of Importance
Let's extract the feature importance ranking from this model so that we can where to focus our analytical efforts when examining target variable i.e. credit card churners
# get list of features from dmodel
feature_names = X.columns
importances = xgb_tuned2.feature_importances_
indices = np.argsort(importances)
# plot the features
plt.figure(figsize=(12, 12))
plt.title("XGBoost RandomSearchCV Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observation:
The most important features to understand customer credit card churn, are
All of these features are negatively correlated with the Attrition Flag, meaning, the lower the values of these features, the higher the chances of a customer to attrite
# create a scatterplot
sns.scatterplot(x=data.Total_Trans_Ct, y=card.Attrition_Flag)
plt.show()
# create a scatterplot
sns.scatterplot(x=data.Total_Revolving_Bal, y=card.Attrition_Flag)
<Axes: xlabel='Total_Revolving_Bal', ylabel='Attrition_Flag'>
# crosstabs
pd.crosstab(index=data["Attrition_Flag"], columns=data["Total_Relationship_Count"])
| Total_Relationship_Count | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| Attrition_Flag | ||||||
| 0 | 677 | 897 | 1905 | 1687 | 1664 | 1670 |
| 1 | 233 | 346 | 400 | 225 | 227 | 196 |
# create a scatterplot
sns.scatterplot(
x=data.Total_Relationship_Count,
y=data.Total_Relationship_Count,
hue=data["Attrition_Flag"],
)
plt.show()
We know that we can use pipelines to standardize the model building, but the steps in a pipeline are applied to each and every variable - how can we personalize the pipeline to perform different processing on different columns?
Column transformer allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data in order to combine several feature extraction mechanisms or transformations into a single transformer.
We will create 2 different pipelines, one for numerical columns and one for categorical columns
For numerical columns, we will do missing value imputation as pre-processing
For categorical columns, we will do one hot encoding and missing value imputation as pre-processing
We are doing missing value imputation for the whole data, so that if there is any missing value in the data in future that can be taken care of.
#View an incoming data frame to be put through pipeline
test_data = data.copy()
type(data)
test_data.head()
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 0 | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 0 | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 0 | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 0 | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
# Separating target variable and other variables
X = test_data.drop(columns=["Attrition_Flag"])
Y = test_data["Attrition_Flag"]
# check head
X.head()
| Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
Now we already know the best model we need to process with, so we don't need to divide data into 3 parts
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1, stratify=Y
)
print(X_train.shape, X_test.shape)
display(X_train)
display(y_train)
(7088, 19) (3039, 19)
| Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4124 | 50 | F | 1 | Graduate | Married | Less than $40K | Blue | 43 | 6 | 1 | 2 | 7985.000 | 0 | 7985.000 | 1.032 | 3873 | 72 | 0.674 | 0.000 |
| 4686 | 50 | M | 0 | NaN | Divorced | $60K - $80K | Blue | 36 | 3 | 3 | 2 | 5444.000 | 2499 | 2945.000 | 0.468 | 4509 | 80 | 0.667 | 0.459 |
| 1276 | 26 | F | 0 | Graduate | Single | $40K - $60K | Blue | 13 | 6 | 3 | 4 | 1643.000 | 1101 | 542.000 | 0.713 | 2152 | 50 | 0.471 | 0.670 |
| 6119 | 65 | F | 0 | College | Single | Less than $40K | Blue | 55 | 3 | 3 | 0 | 2022.000 | 0 | 2022.000 | 0.579 | 4623 | 65 | 0.548 | 0.000 |
| 2253 | 46 | M | 3 | Graduate | Single | $80K - $120K | Blue | 35 | 6 | 3 | 4 | 4930.000 | 0 | 4930.000 | 1.019 | 3343 | 77 | 0.638 | 0.000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4581 | 50 | F | 2 | Graduate | NaN | Less than $40K | Blue | 36 | 5 | 2 | 2 | 1438.300 | 0 | 1438.300 | 0.658 | 2329 | 43 | 0.593 | 0.000 |
| 9338 | 56 | M | 0 | Doctorate | Single | $80K - $120K | Blue | 47 | 2 | 1 | 2 | 7204.000 | 0 | 7204.000 | 0.803 | 14042 | 113 | 0.766 | 0.000 |
| 9950 | 45 | M | 5 | Uneducated | Single | $120K + | Silver | 36 | 4 | 2 | 1 | 34516.000 | 0 | 34516.000 | 0.732 | 8603 | 84 | 0.615 | 0.000 |
| 1784 | 35 | M | 0 | High School | Single | $80K - $120K | Gold | 29 | 6 | 2 | 3 | 34516.000 | 1965 | 32551.000 | 1.044 | 2949 | 70 | 1.000 | 0.057 |
| 4752 | 41 | F | 4 | Uneducated | NaN | Less than $40K | Blue | 32 | 4 | 3 | 1 | 3189.000 | 0 | 3189.000 | 0.910 | 4813 | 86 | 0.755 | 0.000 |
7088 rows × 19 columns
4124 0
4686 0
1276 0
6119 0
2253 0
..
4581 1
9338 0
9950 1
1784 0
4752 0
Name: Attrition_Flag, Length: 7088, dtype: int64
# creating a list of numerical variables
numerical_features = [
"Customer_Age",
"Dependent_count",
"Months_on_book",
"Total_Relationship_Count",
"Months_Inactive_12_mon",
"Contacts_Count_12_mon",
"Credit_Limit",
"Total_Revolving_Bal",
"Avg_Open_To_Buy",
"Total_Amt_Chng_Q4_Q1",
"Total_Trans_Amt",
"Total_Trans_Ct",
"Total_Ct_Chng_Q4_Q1",
"Avg_Utilization_Ratio",
]
# creating a transformer for numerical variables, which will apply simple imputer on the numerical variables
numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="median"))])
# creating a list of categorical variables
categorical_features = [
"Gender",
"Education_Level",
"Marital_Status",
"Income_Category",
"Card_Category",
]
# creating a transformer for categorical variables, which will first apply simple imputer and
# then do one hot encoding for categorical variables
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="most_frequent")), # handle missing
("onehot", OneHotEncoder(handle_unknown="ignore")),
]
)
# handle_unknown = "ignore", allows model to handle any unknown category in the test data
# combining categorical transformer and numerical transformer using a column transformer
preprocessor = ColumnTransformer(
transformers=[ # List of (name, transformer, columns)
("num_step", numeric_transformer, numerical_features),
("cat_step", categorical_transformer, categorical_features),
],
remainder="passthrough",
n_jobs=-1,
verbose=True,
)
# remainder = "passthrough" has been used, it will allow variables that are present in original data
# but not in "numerical_columns" and "categorical_columns" to pass through the column transformer without any changes
# Creating new pipeline with best parameters
production_model = Pipeline(
steps=[
("pre", preprocessor), # pipelines from above
(
"XGB", # best model for prediction
XGBClassifier(
base_score=0.5,
booster="gbtree",
colsample_bylevel=1,
colsample_bynode=1,
colsample_bytree=1,
eval_metric="logloss",
gamma=0,
gpu_id=-1,
importance_type="gain",
interaction_constraints="",
learning_rate=0.300000012,
max_delta_step=0,
max_depth=6,
min_child_weight=1,
missing=np.nan,
monotone_constraints="()",
n_estimators=100,
n_jobs=4,
num_parallel_tree=1,
random_state=1,
reg_alpha=0,
reg_lambda=1,
scale_pos_weight=1,
subsample=1,
tree_method="exact",
validate_parameters=1,
verbosity=None,
),
),
]
)
# view pipeline
print(production_model[0])
display(type(production_model))
ColumnTransformer(n_jobs=-1, remainder='passthrough',
transformers=[('num_step',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='median'))]),
['Customer_Age', 'Dependent_count',
'Months_on_book', 'Total_Relationship_Count',
'Months_Inactive_12_mon',
'Contacts_Count_12_mon', 'Credit_Limit',
'Total_Revolving_Bal', 'Avg_Open_To_Buy',
'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1',
'Avg_Utilization_Ratio']),
('cat_step',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='most_frequent')),
('onehot',
OneHotEncoder(handle_unknown='ignore'))]),
['Gender', 'Education_Level', 'Marital_Status',
'Income_Category', 'Card_Category'])],
verbose=True)
sklearn.pipeline.Pipeline
# Fit the model on training data
production=production_model.fit(X_test, y_test.values)
print(production)
Pipeline(steps=[('pre',
ColumnTransformer(n_jobs=-1, remainder='passthrough',
transformers=[('num_step',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='median'))]),
['Customer_Age',
'Dependent_count',
'Months_on_book',
'Total_Relationship_Count',
'Months_Inactive_12_mon',
'Contacts_Count_12_mon',
'Credit_Limit',
'Total_Revolving_Bal',
'Avg_Open_To_Buy',
'Tot...
feature_types=None, gamma=0, gpu_id=-1,
grow_policy=None, importance_type='gain',
interaction_constraints='',
learning_rate=0.300000012, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=0, max_depth=6, max_leaves=None,
min_child_weight=1, missing=nan,
monotone_constraints='()', multi_strategy=None,
n_estimators=100, n_jobs=4, num_parallel_tree=1, ...))])
attrition_param = pd.DataFrame({'Attrition Parameters ':['Customer Age', 'Gender', 'Dependents_No','Educational Level', 'Marital Status','Income Level',
'Card type','Months of Book','Total_Relationship_Count','Credit_Limit','Total_Revolving_Bal',
'Avg_Open_To_Buy','Total_Amt_Chng_Q4_Q1','Total_Trans_Amt','Total_Trans_Ct','Total_Ct_Chng_Q4_Q1','Avg_Utilization_Ratio', 'No of Products']
,'Indicators':['43 to 48 years old','Females','3 Dependents','Graduate','Married','Less than $40k','Blue & Platinum Cardholders',
'36 to 39 months','3 Contacts Counts_12_mon','$1,438.30','$0.00','$1,438.30','0.6%','$2108-$2312','40-44','0.5-0.6%','0.00%', '2-3 products']}
,index=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18])
print(' Attrition Indicators')
print("*" *22)
attrition_param
Attrition Indicators **********************
| Attrition Parameters | Indicators | |
|---|---|---|
| 1 | Customer Age | 43 to 48 years old |
| 2 | Gender | Females |
| 3 | Dependents_No | 3 Dependents |
| 4 | Educational Level | Graduate |
| 5 | Marital Status | Married |
| 6 | Income Level | Less than $40k |
| 7 | Card type | Blue & Platinum Cardholders |
| 8 | Months of Book | 36 to 39 months |
| 9 | Total_Relationship_Count | 3 Contacts Counts_12_mon |
| 10 | Credit_Limit | $1,438.30 |
| 11 | Total_Revolving_Bal | $0.00 |
| 12 | Avg_Open_To_Buy | $1,438.30 |
| 13 | Total_Amt_Chng_Q4_Q1 | 0.6% |
| 14 | Total_Trans_Amt | $2108-$2312 |
| 15 | Total_Trans_Ct | 40-44 |
| 16 | Total_Ct_Chng_Q4_Q1 | 0.5-0.6% |
| 17 | Avg_Utilization_Ratio | 0.00% |
| 18 | No of Products | 2-3 products |
From the attritable parameter table above, the following have been deduced:
Female bank customers churn the most with a percentage of 11.4% compared to males who have a percentage of 9%.
Bank customers between the ages of 43 and 48 age group have a higher churn percentage than other age groups in the dataset.
Customers who are married and have 3 nos of dependents are going to churn
Bank customers with Blue and Platinum card types churn more with a percentage of 5.5% than others.
Customers whose educational level is Graduate are more likely to attrite than others.
Bank customers who have 36 to 39 months on book are more likely to churn.
Customers who have less than $40000 per anum income level are more likely to churn than others
Bank customer who have had 3 number of contact counts are more than likely to churn.
Bank customers whose credit limit and average open to buy amount is around $1430 are more likely to churn
Bank customers who purchased 2-3 products through the bank have a larger churn percentage than other categories
We need to take note of customers who have between $2108 and $2312 total volume of transaction, because they are more likely to churn.
Customers who have between 40-44 counts of transactions in the review period are more likely to churn.
Bank should connect with the customer more often, and provide the customer with various offers and schemes to increase relationships of the customer with the bank
Bank should offer cashback schemes on credit cards on a wide spectrums that will include many stores and businesses, which might encourage the customers to continue using the credit card more often
Bank should also offer credit limit increase for the customers who are regularly using their credit cards. This would definitely increase the credit card spends/transaction amounts.
0% interest EMI (equated monthly installment) on credit card is also another offer that can be provided to customers to encourage them to buy products of higher cost with their credit card, and convert the expenditure to EMI. This will increase the total transaction amount as well as transaction counts of the customers in a given cut-off period. Further, this would have a simultaneous effect on the rovilving balance as well.
Along with the available types of cards, bank can introduce credit cards specifically directed to influence more online shopping (with certain percentage %ge cashback offers) or online food ordering on Doordash, Ubereats etc. This would geometrically increase the card usage per period and would make it nearly difficult for the customers to churn the card.
With our model, it is expediently possible to predict which customers are likely to attrite. Also according to the predicted probability, so arrangements should be made for the marketting department to reach through different advertising schemes to at least top 20-30% customers in order to show them new credit card offers, increased credit limit, newly opened cashback regimes et cetera. This can be used as a pilot program to try retain these customers and afterwards take it to everybody else.
!pip install nbconvert
Requirement already satisfied: nbconvert in /usr/local/lib/python3.10/dist-packages (6.5.4) Requirement already satisfied: lxml in /usr/local/lib/python3.10/dist-packages (from nbconvert) (4.9.3) Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (4.11.2) Requirement already satisfied: bleach in /usr/local/lib/python3.10/dist-packages (from nbconvert) (6.1.0) Requirement already satisfied: defusedxml in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.7.1) Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.4) Requirement already satisfied: jinja2>=3.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (3.1.2) Requirement already satisfied: jupyter-core>=4.7 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.5.0) Requirement already satisfied: jupyterlab-pygments in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.2.2) Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (2.1.3) Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.8.4) Requirement already satisfied: nbclient>=0.5.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.9.0) Requirement already satisfied: nbformat>=5.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.9.2) Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from nbconvert) (23.2) Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (1.5.0) Requirement already satisfied: pygments>=2.4.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (2.16.1) Requirement already satisfied: tinycss2 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (1.2.1) Requirement already satisfied: traitlets>=5.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.7.1) Requirement already satisfied: platformdirs>=2.5 in /usr/local/lib/python3.10/dist-packages (from jupyter-core>=4.7->nbconvert) (3.11.0) Requirement already satisfied: jupyter-client>=6.1.12 in /usr/local/lib/python3.10/dist-packages (from nbclient>=0.5.0->nbconvert) (6.1.12) Requirement already satisfied: fastjsonschema in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.1->nbconvert) (2.18.1) Requirement already satisfied: jsonschema>=2.6 in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.1->nbconvert) (4.19.2) Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->nbconvert) (2.5) Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from bleach->nbconvert) (1.16.0) Requirement already satisfied: webencodings in /usr/local/lib/python3.10/dist-packages (from bleach->nbconvert) (0.5.1) Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert) (23.1.0) Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert) (2023.7.1) Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert) (0.30.2) Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert) (0.12.0) Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (23.2.1) Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (2.8.2) Requirement already satisfied: tornado>=4.1 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (6.3.2)
%%shell
jupyter nbconvert --to html '/content/drive/My Drive/Project 3/Samson_Akomolafe_Machine_Learning_Project_3.ipynb'
[NbConvertApp] Converting notebook /content/drive/My Drive/Project 3/Samson_Akomolafe_Machine_Learning_Project_3.ipynb to html [NbConvertApp] Writing 4420318 bytes to /content/drive/My Drive/Project 3/Samson_Akomolafe_Machine_Learning_Project_3.html